class: center, middle, inverse, title-slide # Lecture 6 ## Model Evaluation ### Psych 10 C ### University of California, Irvine ### 04/11/2022 --- ## Summary - Last week we started with two different hypotheses about the relation between lung capacity and smoking status. -- - The Null model stated that there where no differences in lung capacity as a function of smoking status, the model was formalized as: `$$y_{ij}\sim\text{Normal}(\mu,\sigma^2)$$` -- - We found two estimators for the parameters in the model, `\(\hat{\mu}\)` which is the average of the participants lung capacity regardless of smoking status. -- - And `\(\hat{\sigma}^2_0\)` which is the average error or variability of our observations when we use `\(\hat{\mu}\)` as a prediction of each observation. -- - Finally we said that we would be interested on the Sum of Squared Errors of the Null model which is defined as: `$$SSE_0 = \sum_j \sum_i \left(y_{ij}-\hat{\mu}\right)^2$$` --- ## Summary - Our second hypothesis was the Effects Model, which assumes that there is a difference in lung capacity as a function of smoking status. This model is formalized as: -- `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` -- - The estimator of our parameter `\(\mu_j\)` (one for each group `\(\hat{\mu}_j\)`) was equal to the average of each group (taken independently). -- - The estimator for `\(\sigma_e^2\)` was equal to the error of the model, which is the average squared difference between each observation and the model's prediction for that observation `\(\hat{\mu}_j\)`. -- - Finally, we mention that we will also be interested in the Sum of Squared Errors of the Effects Model, which is defined as: `$$SSE_e = \sum_j \sum_i \left(y_{ij}-\hat{\mu}_j\right)^2$$` --- ## Adding predictions - The firs thing that we want to do is add our predictions and the squared error of each observation to our data. -- ```r # total sample size n_total <- nrow(smokers) # get predictions for the null model null_pred <- smokers %>% summarise("pred" = mean(lung_capacity)) %>% pull(pred) # get the predictions of effects model (\hat{\mu}_j) eff_pred <- smokers %>% group_by(smoke_status) %>% summarise("prediction" = mean(lung_capacity)) # add predictions to data smokers <- smokers %>% mutate("pred_null" = rep(x = null_pred, times = n_total), "pred_eff" = ifelse(test = smoke_status == "non_smoker", yes = eff_pred$prediction[1], no = eff_pred$prediction[2])) ``` --- ## Adding errors - Now to add the squared errors we can use the difference between observation and prediction and square it: ```r smokers <- smokers %>% mutate("error_null" = (lung_capacity - pred_null)^2, "error_eff" = (lung_capacity - pred_eff)^2) ``` -- - Now our data file has the relevant variables for each observation:
--- ## Sum of Squared Errors - Using the updated table it's easy to get the values of the SSE and the estimators `\(\hat{\sigma}_0^2\)` and `\(\hat{\sigma}_e^2\)`. ```r # the sse of the null model is: sse_0 <- sum(smokers$error_null) # the sse for the effects model is: sse_e <- sum(smokers$error_eff) # mean sse null model sigma_0 <- 1/n_total * sse_0 # mean sse effects model sigma_e <- 1/n_total * sse_e ``` -- Their values are: .pull-left[ Null Model `\(SSE_0\)` = 252.36 `\(\hat{\sigma}_0^2\)` = 31.54 ] .pull-right[ Effects Model `\(SSE_e\)` = 46.31 `\(\hat{\sigma}_e^2\)` = 5.79 ] --- class: inverse, center, middle # Model Evaluation ## `\(R^2\)` --- ## Model Evaluation: `\(R^2\)` - `\(R^2\)` is a method that allows us to "measure" how good a general model is in comparison to a **nested** model. -- - In our problem, the Null model is nested on the Effects model, but what does it mean for a model to be nested? -- - We say that one model is nested on another when the values of the parameters of the **nested** model are a special case of the **general** model. -- - For example, in our effects model there is nothing that makes it impossible for `\(\mu_1\)` to be equal to `\(\mu_2\)`, which would be the same as saying that the expectation of each group are the same... -- - But that was the assumption of the **Null Model**! -- - This means that the Null model is nested in the Effects model, in other words, the **Null model** is a special case of the **Effects model**. --- ## model Evaluation: `\(R^2\)` - What does it mean for our results for the Null model to be nested on the Effects model? -- - Well this means that, in terms of error or variability, the Effects model will always have a lower than or equal error than the Null model. -- - In other words, the worse that the Effects model can do is to be equal to the Null model, everything else will make the Effects model better. -- - We can express this with the following equation: `$$SSE_0 = SSE_a + SSE_e$$` -- - Where the new variable `\(SSE_a\)` represents the error or variation that is reduced or accounted for when we use the Effects model In comparison to the Null. -- - Given that the `\(SSE_e\)` is always **lower or equal** than the `\(SSE_0\)` we can express the error accounted for by the Effects model as: `$$SSE_a = SSE_0 - SSE_e$$` --- ## Model Evaluation: `\(R^2\)` - `\(R^2\)` can be interpreted as the proportion of error accounted for by the effects model out of the total error. -- - The total error is the error of the Null model, and the error accounted for is `\(SSE_a\)`. Therefore, we can express the proportion of accounted error `\((R^2)\)` as: `$$R^2 = \frac{SSE_a}{SSE_0} = \frac{SSE_0 - SSE_e}{SSE_0}$$` -- - By definition we have that: `$$0 \leq R^2 \leq 1$$` --- ## Model Evaluation: `\(R^2\)` - When `\(R^2 = 0\)` there is no difference between the error (variance) of the models. -- - When `\(R^2 = 1\)` it means that the error (variance) of the Effects model is 0 (which can't happen unless that our observations have no variability). -- - This means that the closer the value of `\(R^2\)` is to 1 the better the effects model is! -- - Whenever you see a Journal paper that uses a linear model it is likely that you will encounter a value of `\(R^2\)` defined as the proportion of error accounted for by the Effects model (or linear model). -- - Let's calculate the proportion of error accounted for in our smokers example. --- ## `\(R^2\)`: Smokers Experiment - Using the Sum of Squared Errors that we calculated before we can obtain the proportion of error accounted for by the effects model. -- - Go back to slide 6 and using the values there, calculate the proportion of error accounted for by the effects model in the smokers data set. -- <br> .can-edit.key-likes[ `\(R^2 =\)` ] --- ## `\(R^2\)` - `\(R^2\)` will give us a first look at how the Effects model is doing in a given situation. -- - However, there is a problem with `\(R^2\)` that can't be avoided... -- - Remember that we said that the error (variance) of the Effects model will always be equal or lower than that of the Null model. -- - This is because using two distributions instead of one gives us more flexibility. -- - In other words, we can always do better using two normal distributions than a single one as in the Null model. --- class: inverse, middle, center # Model fit and Model complexity --- ## Model fit vs Model complexity - If the error of the effects model is always lower, do we always want to conclude that the null model is inappropriate? -- - This is an **essential** problem for the model comparison approach (and for Statistics) in general. -- - How should we balance the reduction of error (AKA: **Model fit**) and the flexibility of our models or hypothesis (AKA: **Model complexity**). -- - The concept of **Model fit** is widely used in statistics, intuitively, it refers to any "measure" of the error in the predictions of our models. -- - For example, the Sum of Squared Errors (both for the null and effects model) can be considered a "measure" of **Model fit**, as it summarizes, in a single number, the amount of error between our observations and the predictions. --- ## SSE as a measure of Model fit - We know that by definition, the SSE associated to the Null model will always be greater than (or equal) to the SSE of the Effects model. -- - In other words, the Effects model will always **fit** the data better. -- - However, we know that one of the reasons why this is the case (Effects model fits the data better) is because the model is more flexible. -- - We refer to this flexibility as **Model complexity**, and in general, we would like to have a way to assign a number to it, similar to how the SSE does with model fit. -- - If we can assign numbers to both concepts, then we will be able to decide which model is more appropriate taking into account how well the model fits the data (comparing SSE) and how flexible the model is (model complexity). --- class: inverse, middle, center # Model Evaluation ## Bayesian Information Criterion (BIC) --- ## Model fit - The Bayesian Information Criterion is a general approach that allows us to weight a models' fit and its complexity. -- - It uses the natural logarithm of the Mean Squared error ( `\(\hat{\sigma}_0^2\)` and `\(\hat{\sigma}_e^2\)` ) of the Null and Effects model as a measure of fit. -- `$$\text{fit} = n\ ln \left(\hat{\sigma}^2\right)$$` -- - Where `\(n\)` represents the total number of observations. -- - The use of the natural logarithm is associated with the mathematical approach used to derive the BIC. However, the interpretation is similar. -- - In other words, if `\(\hat{\sigma}^2\)` increases, so does `\(ln(\hat{\sigma}^2)\)`. -- - So we can interpret the number as "how bad does the model fit the data" --- ## Model complexity - Model complexity is measured as a function of the number of parameters of a model associated with its predictions. -- - For example, the Null model uses a single prediction `\(\hat{\mu}\)` so we will count it as only one parameter. -- - The Effects model has two predictions, one for each group `\(\hat{\mu}_1\)` and `\(\hat{\mu}_2\)`, so we say that it has two parameters. -- `$$\text{complexity} = k\ ln\left(n \right)$$` -- - Where `\(k\)` represents the number of parameters (1 for the Null or 2 for the Effects model) and `\(n\)` is the number of observations (participants). -- - The value of complexity will increase as the number of parameters increases. -- - In other words, this function will always give a larger value of complexity to the Effects model in comparison to the Null model. --- ## BIC - The Bayesian Information Criterion (**BIC**) is defined as the sum of the "badness of fit" and the model complexity. -- `$$BIC = \text{fit } + \text{ complexity} = n\ ln \left(\hat{\sigma}^2\right)\ +\ k\ ln\left(n \right)$$` -- - Therefore, the BIC of the Null model is equal to: `$$BIC_0 = n\ ln \left(\hat{\sigma}_0^2\right)\ +\ ln\left(n \right)$$` -- - And the BIC of the Effects model is equal to: `$$BIC_e = n\ ln \left(\hat{\sigma}_e^2\right)\ +\ 2\ ln\left(n \right)$$` --- ## Example: Smoking - In our smoking example, the BIC value for the Null model was: `$$BIC_0 = 8\ ln \left(31.54\right)\ +\ ln\left(8 \right) = 29.69$$` -- - For the Effects model the BIC was equal to: `$$BIC_e = 8\ ln \left(5.79\right)\ +\ 2\ ln\left(8 \right) = 18.20$$` -- - This means that the added flexibility of the Effects model (remember that it uses two normal distributions instead of one), is not enough to offset how much it improves our predictions of lung capacity. -- - In general, we should always choose the model with the lowest **BIC**!